For this project you will take the role of a consultant hired by a real estate investment firm in Ames, Iowa, a mid-west town in the United States, to analyze data in order to help provide insight into how the firm should invest for highest profits, and to quantify and communicate to the company management what types of real estate properties are good investments and why. They have provided you with data on housing sales from between 2006 to 2010 that contains information about the characteristics of the house (number of bedrooms, number of bathrooms, square footage, etc.) and the house’s sale price. The codebook for this data set is available online here or in the Data folder in your repo.
It’s generally a bad idea to buy the most expensive house in the neighborhood. And remember the real estate agents’ mantra: Location, location, location! Keep in mind that the goal is to make money for your investors, and hence investing in a property that is overvalued (costing more than it is worth) is rarely a good idea. This means that it’s critical to know which properties are overvalued and which are undervalued. The company that hired you has many questions for you about the housing market. It is up to you to decide what methods you want to use (frequentist or Bayesian) to answer these questions, and implement them to help to identify undervalued and overvalued properties.
You will have three data sets: a subset for training, a subset for testing, and a third subset for validation. You will be asked to do data exploration and build your model (or models) initially using only the training data. Then, you will test your model on the testing data, and finally validate using the validation data. We are challenging you to keep your analysis experience realistic, and in a realistic scenario you would not have access to all three of these data sets at once. You will be able to see on our scoreboard how well your team is doing based on its predictive performance on the testing data. After your project is turned in you will see the final score on the validation set.
All members of the team should contribute equally and answer any questions about the analysis at the final presentation.
For your analysis create a new notebook named “project.Rmd” and update accordingly rather than editing this.
To get started read in the training data.
library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
library(tidyr)
load("ames_train.Rdata")
print(paste0("The dataset has ", dim(ames_train)[1], " number of observations and ", dim(ames_train)[2], " features"))
[1] "The dataset has 1500 number of observations and 81 features"
#Variables with NA's and their proportion of missing data
miss = apply(is.na(ames_train), 2, sum)
miss_prop = round(miss[miss>0]/nrow(ames_train), 3)
print(miss_prop)
Lot.Frontage Alley Mas.Vnr.Area Bsmt.Qual Bsmt.Cond Bsmt.Exposure BsmtFin.Type.1 BsmtFin.Type.2 Bsmt.Full.Bath Bsmt.Half.Bath
0.188 0.930 0.004 0.034 0.034 0.034 0.034 0.034 0.001 0.001
Fireplace.Qu Garage.Type Garage.Yr.Blt Garage.Finish Garage.Qual Garage.Cond Pool.QC Fence Misc.Feature
0.474 0.046 0.047 0.046 0.047 0.047 0.995 0.789 0.965
which(miss_prop>0.5) # four features have greater than 50% of data "missing" -- drop these variables
Alley Pool.QC Fence Misc.Feature
2 17 18 19
Notes about data cleaning:
We dropped “utilities” (type of utilities available) since in the training set, only 2 observations did not have all the utilities (electricity, gas, water and sewage). Intuitively, most modern property are equipped with these basic public utilities and keeping the variable would therefore be unnecessary.
We also dropped “condition 2” (proximity to various conditions if more than one is present) since it seemed redundant from our training set. Given Condition 1, only 12 properties were not close to normal conditions.
In the original scale, 1990 and 1900 would not be much different. Therefore, we changed the scale to the number of years since last construction or remodelling, subtracted from year 2010 (the end year in the dataset).
Another variable dropped in our model was “roof material” since only 1% of the property used material other than “standard composite shingle”. Similarly, “heating” was also dropped since more than 95% of the property has gas forced warm air furnace (GasA) instead of other types of heating.
For “exterior quality” and “exterior condition” , we recoded these ordinal variables to 1-5 to replace the original scale of conditions (from poor to excellent). Similarly, we recoded “basement exposure” and “basement rating” except that the new scale would start from 0 for properties without basement.
Exter Qual (Ordinal): Evaluates the quality of the material on the exterior
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
Exter Cond (Ordinal): Evaluates the present condition of the material on the exterior
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
Continuous variables such as “1st floor square feet” and “2nd floor square feet” were log-transformed for interpretation purpose.
For variable “functional”, we recoded different ordinal levels into binary levels — typical functionality or not, including minor and major deductions.
We summed up the number of bathrooms to one continuous variable. Note that one half-bathroom would be coded as 0.5.
# Did not remove any NA entries in Lot.frontage
data=ames_train
data <- data %>%
#filter(!is.na(Lot.Frontage)) %>%
mutate(MS.SubClass= factor(MS.SubClass)) %>%
mutate(Alley = factor(Alley, levels = levels(addNA(Alley)), labels = c(levels(Alley), "None"), exclude = NULL)) %>%
mutate(HouseAge = Yr.Sold- pmax(Year.Built, Year.Remod.Add)) %>%
filter(!is.na(Mas.Vnr.Area)) %>%
mutate(Bsmt.YN = 1*(!is.na(Bsmt.Qual))) %>%
mutate(Bsmt.Qual = factor(Bsmt.Qual, levels = levels(addNA(Bsmt.Qual)), labels = c(levels(Bsmt.Qual), "None"), exclude = NULL)) %>%
mutate(Bsmt.Qual = relevel(Bsmt.Qual, ref="None")) %>%
mutate(Bsmt.Cond = factor(Bsmt.Cond, levels = levels(addNA(Bsmt.Cond)), labels = c(levels(Bsmt.Cond), "None"), exclude = NULL)) %>%
mutate(Bsmt.Cond = relevel(Bsmt.Cond, ref="None")) %>%
mutate(Bsmt.Exposure = factor(Bsmt.Exposure, levels = levels(addNA(Bsmt.Exposure)), labels = c(levels(Bsmt.Exposure), "None"), exclude = NULL)) %>%
mutate(Bsmt.Exposure = relevel(Bsmt.Exposure, ref="None")) %>%
mutate(BsmtFin.Type.1= factor(BsmtFin.Type.1, levels = levels(addNA(BsmtFin.Type.1)), labels = c(levels(BsmtFin.Type.1), "None"), exclude = NULL)) %>%
mutate(BsmtFin.Type.1 = relevel(BsmtFin.Type.1, ref="None")) %>%
mutate(BsmtFin.Type.2= factor(BsmtFin.Type.2, levels = levels(addNA(BsmtFin.Type.2)), labels = c(levels(BsmtFin.Type.2), "None"), exclude = NULL)) %>%
mutate(BsmtFin.Type.2 = relevel(BsmtFin.Type.2, ref="None")) %>%
mutate(X12.SF= X1st.Flr.SF+ X2nd.Flr.SF) %>%
filter(!is.na(Bsmt.Full.Bath)) %>%
filter(!is.na(Bsmt.Half.Bath)) %>%
mutate(Baths = Bsmt.Full.Bath + 0.5*Bsmt.Half.Bath + Full.Bath + 0.5*Half.Bath) %>%
mutate(Fireplace.YN = 1*(Fireplaces>0)) %>%
mutate(Fireplace.Qu = factor(Fireplace.Qu, levels = levels(addNA(Fireplace.Qu)), labels = c(levels(Fireplace.Qu), "None"), exclude = NULL)) %>%
mutate(Fireplace.Qu = relevel(Fireplace.Qu, ref="None")) %>%
mutate(Garage.YN = 1*(!is.na(Garage.Cond))) %>%
mutate(Garage.Type = factor(Garage.Type, levels = levels(addNA(Garage.Type)), labels = c(levels(Garage.Type), "None"), exclude = NULL)) %>%
mutate(Garage.Type = relevel(Garage.Type, ref="None")) %>%
mutate(Garage.Finish = factor(Garage.Finish, levels = levels(addNA(Garage.Finish)), labels = c(levels(Garage.Finish), "None"), exclude = NULL)) %>%
mutate(Garage.Finish = relevel(Garage.Finish, ref="None")) %>%
mutate(Garage.Qual = factor(Garage.Qual, levels = levels(addNA(Garage.Qual)), labels = c(levels(Garage.Qual), "None"), exclude = NULL)) %>%
mutate(Garage.Qual = relevel(Garage.Qual, ref="None")) %>%
mutate(Garage.Cond = factor(Garage.Cond, levels = levels(addNA(Garage.Cond)), labels = c(levels(Garage.Cond), "None"), exclude = NULL)) %>%
mutate(Garage.Cond = relevel(Garage.Cond, ref="None")) %>%
mutate(Porch.Area = Wood.Deck.SF+ Open.Porch.SF+Enclosed.Porch+X3Ssn.Porch + Screen.Porch) %>%
mutate(Pool.YN = 1*(Pool.Area>0)) %>%
mutate(Pool.QC = factor(Pool.QC, levels = levels(addNA(Pool.QC)), labels = c(levels(Pool.QC), "None"), exclude = NULL)) %>%
mutate(Pool.QC = relevel(Pool.QC, ref="None")) %>%
mutate(Fence = factor(Fence, levels = levels(addNA(Fence)), labels = c(levels(Fence), "None"), exclude = NULL)) %>%
mutate(Misc.Feature = factor(Misc.Feature, levels = levels(addNA(Misc.Feature)), labels = c(levels(Misc.Feature), "None"), exclude = NULL)) %>%
mutate(Mo.Sold = as.factor(Mo.Sold)) %>%
mutate(Yr.Sold = as.factor(Yr.Sold)) %>%
dplyr::select(-Garage.Yr.Blt) %>%
mutate(Condition.1 = as.character(Condition.1)) %>%
mutate(Kitchen.Qual=plyr::mapvalues(Kitchen.Qual, from = c("Po", "Fa", "TA","Gd", "Ex" ), to = c("1", "2", "3", "4", "5"))) %>%
mutate(Kitchen.Qual = as.numeric(as.character(Kitchen.Qual))) %>%
mutate(Heating.QC=plyr::mapvalues(Heating.QC, from = c("Po", "Fa", "TA","Gd", "Ex" ), to = c("1", "2", "3", "4", "5"))) %>%
mutate(Heating.QC = as.numeric(as.character(Heating.QC))) %>%
mutate(Bsmt.Qual = droplevels(Bsmt.Qual)) %>%
mutate(Functional = droplevels(Functional)) %>%
mutate(Roof.Matl = droplevels(Roof.Matl))
# Simplify Condition 1 (Park, Rail, Normal)
ind_rail<-which(data$Condition.1=="RRNn" | data$Condition.1=="RRAn" | data$Condition.1=="RRNe" | data$Condition.1=="RRAe")
ind_park<-which(data$Condition.1=="PosN" | data$Condition.1=="PosA")
data$Condition.1[ind_rail]<-"Rail"
data$Condition.1[ind_park]<-"Park"
data = data %>%
mutate(Condition.1 = factor(Condition.1)) %>%
mutate(Condition.1 = relevel(Condition.1, ref="Norm"))
# Eliminate the one entry in 'Exposure' that had been left completely empty
data_train<-data
data_train$Bsmt.Exposure[which(data_train$Bsmt.Exposure=="")]<-"None"
data_train$Bsmt.Exposure<-droplevels(data_train$Bsmt.Exposure)
data_train$Pool.Area<-data_train$Pool.Area+1
data_train$Total.Bsmt.SF<-data_train$Total.Bsmt.SF+1
The Neighborhood variable, typically of little interest other than to model the location effect, may be of more relevance when used with the map.
We are restricting attention to just the “normal sales” condition.
In the first model you are allowed only limited manipulations of the original data set to predict the sales price price. You are allowed to take power transformations of the original variables [square roots, logs, inverses, squares, etc.] but you are NOT allowed to create interaction variables. This means that a variable may only be used once in an equation [if you use $ x^2$ don’t use \(x\)]. Additionally, you may eliminate any data points you deem unfit. This model should have a minimum r-square of 73% (in the original units) and contain at least 6 variables but fewer than 20.
### perfromance evlaution function
performance<- function(Y, Yhat){
bias<- mean(Y-Yhat[,1])
max.dev<-max(abs(Y-Yhat[,1]))
mean.dev<-mean(abs(Y-Yhat[,1]))
RMSE<-sqrt(mean((Y-Yhat[,1])^2))
coverage<-mean((Y>Yhat[,2]) & (Y<Yhat[,3]))
out<-data.frame(bias=bias, max.dev=max.dev, mean.dev=mean.dev, RMES=RMSE, Coverage=coverage)
return(out)
}
library(MASS)
Attaching package: ‘MASS’
The following object is masked from ‘package:dplyr’:
select
# Base model with transformed predictors
model=lm(price ~ MS.SubClass + MS.Zoning + log(Lot.Frontage) + log(Lot.Area) + Street + Alley + Lot.Shape + Land.Contour + Lot.Config + Land.Slope + Neighborhood + Condition.1 + Bldg.Type + House.Style + Overall.Qual + Overall.Cond + HouseAge + Roof.Style + Roof.Matl + Exterior.1st + Mas.Vnr.Type + log(1+Mas.Vnr.Area) + Exter.Cond + Exter.Qual + Foundation + Bsmt.Qual + Bsmt.Cond + Bsmt.Exposure + Total.Bsmt.SF + Heating + Heating.QC + Central.Air + Electrical + log(X12.SF) + log(1+Low.Qual.Fin.SF) + Baths + Bedroom.AbvGr + Kitchen.AbvGr + Kitchen.Qual + Functional + Fireplaces + Fireplace.Qu + Garage.Type + Garage.Finish + Garage.Cars + Garage.Cond + Garage.Qual + Paved.Drive + log(1+Pool.Area) + Pool.QC + Fence + Misc.Val + Mo.Sold +Yr.Sold + Sale.Type + TotalSq, data=data_train)
# Boxcox (indicates that log is decent)
l<-boxcox(model)
expo<-round(l$x[which.max(l$y)],2)
## Current model
## log(Lot.Frontage) currently removed to have more data points (was not included when left in the model with BIC)
model.0=lm(log(price) ~ MS.SubClass + MS.Zoning + log(Lot.Area) + Street + Alley + Lot.Shape + Land.Contour + Lot.Config + Land.Slope + Neighborhood + Condition.1 + Bldg.Type + House.Style + Overall.Qual + Overall.Cond + HouseAge + Roof.Style + Roof.Matl + Exterior.1st + Mas.Vnr.Type + log(1+Mas.Vnr.Area) + Exter.Cond + Exter.Qual + Foundation + Bsmt.Qual + Bsmt.Cond + Bsmt.Exposure + log(Total.Bsmt.SF) + Bsmt.YN+ Heating + Heating.QC + Central.Air + log(1+Low.Qual.Fin.SF) + Baths + Bedroom.AbvGr + Kitchen.AbvGr + Kitchen.Qual + Functional + Fireplaces + Fireplace.Qu + Garage.Type + Garage.Finish + Garage.Cars +Garage.Qual+Garage.Cond + Paved.Drive + log(Pool.Area) + Pool.QC + Fence + Misc.Val + Mo.Sold +Yr.Sold + Sale.Type + log(TotalSq) + Pool.YN , data=data_train)
# There are some perfect collinearities in this model -> eliminate it via AIC/BIC
summary(model.0)
Call:
lm(formula = log(price) ~ MS.SubClass + MS.Zoning + log(Lot.Area) +
Street + Alley + Lot.Shape + Land.Contour + Lot.Config +
Land.Slope + Neighborhood + Condition.1 + Bldg.Type + House.Style +
Overall.Qual + Overall.Cond + HouseAge + Roof.Style + Roof.Matl +
Exterior.1st + Mas.Vnr.Type + log(1 + Mas.Vnr.Area) + Exter.Cond +
Exter.Qual + Foundation + Bsmt.Qual + Bsmt.Cond + Bsmt.Exposure +
log(Total.Bsmt.SF) + Bsmt.YN + Heating + Heating.QC + Central.Air +
log(1 + Low.Qual.Fin.SF) + Baths + Bedroom.AbvGr + Kitchen.AbvGr +
Kitchen.Qual + Functional + Fireplaces + Fireplace.Qu + Garage.Type +
Garage.Finish + Garage.Cars + Garage.Qual + Garage.Cond +
Paved.Drive + log(Pool.Area) + Pool.QC + Fence + Misc.Val +
Mo.Sold + Yr.Sold + Sale.Type + log(TotalSq) + Pool.YN, data = data_train)
Residuals:
Min 1Q Median 3Q Max
-0.43874 -0.04567 0.00024 0.04925 0.25557
Coefficients: (8 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.135e+00 2.287e-01 31.199 < 2e-16 ***
MS.SubClass30 -7.720e-02 1.621e-02 -4.762 2.13e-06 ***
MS.SubClass40 -2.625e-02 5.441e-02 -0.483 0.629528
MS.SubClass45 -1.585e-01 1.216e-01 -1.304 0.192564
MS.SubClass50 -2.953e-02 3.292e-02 -0.897 0.369868
MS.SubClass60 -3.433e-02 2.747e-02 -1.250 0.211688
MS.SubClass70 -7.417e-02 2.987e-02 -2.483 0.013157 *
MS.SubClass75 -1.948e-02 8.094e-02 -0.241 0.809826
MS.SubClass80 -6.554e-02 4.932e-02 -1.329 0.184105
MS.SubClass85 -3.716e-02 3.572e-02 -1.040 0.298403
MS.SubClass90 -2.750e-02 2.873e-02 -0.957 0.338788
MS.SubClass120 3.104e-02 4.763e-02 0.652 0.514768
MS.SubClass150 -1.220e-01 1.210e-01 -1.008 0.313770
MS.SubClass160 -5.059e-02 5.922e-02 -0.854 0.393087
MS.SubClass180 -2.540e-02 8.118e-02 -0.313 0.754415
MS.SubClass190 -7.996e-02 9.993e-02 -0.800 0.423747
MS.ZoningC (all) -8.623e-02 1.100e-01 -0.784 0.433200
MS.ZoningFV 1.407e-01 1.042e-01 1.351 0.176927
MS.ZoningI (all) -2.767e-02 1.291e-01 -0.214 0.830289
MS.ZoningRH 1.659e-01 1.037e-01 1.599 0.110075
MS.ZoningRL 1.231e-01 9.957e-02 1.236 0.216735
MS.ZoningRM 7.809e-02 1.008e-01 0.775 0.438763
log(Lot.Area) 9.002e-02 1.010e-02 8.912 < 2e-16 ***
StreetPave -1.815e-02 4.429e-02 -0.410 0.682128
AlleyPave -1.134e-02 2.326e-02 -0.488 0.625946
AlleyNone 1.801e-02 1.395e-02 1.292 0.196747
Lot.ShapeIR2 -6.732e-04 1.582e-02 -0.043 0.966069
Lot.ShapeIR3 6.258e-03 3.092e-02 0.202 0.839656
Lot.ShapeReg -1.546e-03 5.964e-03 -0.259 0.795449
Land.ContourHLS 2.953e-02 1.852e-02 1.594 0.111086
Land.ContourLow -5.651e-03 2.377e-02 -0.238 0.812139
Land.ContourLvl 2.114e-02 1.375e-02 1.537 0.124571
Lot.ConfigCulDSac 1.419e-02 1.163e-02 1.220 0.222861
Lot.ConfigFR2 -3.046e-02 1.445e-02 -2.108 0.035210 *
Lot.ConfigFR3 -2.514e-02 4.191e-02 -0.600 0.548733
Lot.ConfigInside 1.273e-02 6.572e-03 1.937 0.053009 .
Land.SlopeMod 1.605e-02 1.453e-02 1.104 0.269592
Land.SlopeSev -7.217e-02 5.346e-02 -1.350 0.177255
NeighborhoodBlueste 1.012e-01 4.633e-02 2.184 0.029154 *
NeighborhoodBrDale 1.193e-02 4.322e-02 0.276 0.782511
NeighborhoodBrkSide 2.929e-02 3.587e-02 0.817 0.414264
NeighborhoodClearCr 1.518e-02 3.742e-02 0.406 0.685092
NeighborhoodCollgCr -4.175e-03 3.026e-02 -0.138 0.890275
NeighborhoodCrawfor 6.731e-02 3.434e-02 1.960 0.050179 .
NeighborhoodEdwards -8.750e-02 3.226e-02 -2.712 0.006778 **
NeighborhoodGilbert -4.620e-02 3.155e-02 -1.464 0.143425
NeighborhoodGreens 5.330e-02 4.491e-02 1.187 0.235555
NeighborhoodGrnHill 4.587e-01 7.033e-02 6.522 9.94e-11 ***
NeighborhoodIDOTRR -4.480e-02 3.976e-02 -1.127 0.260077
NeighborhoodLandmrk -8.096e-02 9.648e-02 -0.839 0.401562
NeighborhoodMeadowV -9.830e-02 4.989e-02 -1.970 0.049004 *
NeighborhoodMitchel -1.939e-02 3.247e-02 -0.597 0.550505
NeighborhoodNAmes -5.754e-02 3.165e-02 -1.818 0.069271 .
NeighborhoodNoRidge 7.880e-02 3.380e-02 2.332 0.019881 *
NeighborhoodNPkVill 4.515e-02 4.172e-02 1.082 0.279425
NeighborhoodNridgHt 3.993e-02 3.232e-02 1.235 0.216929
NeighborhoodNWAmes -3.705e-02 3.264e-02 -1.135 0.256547
NeighborhoodOldTown -5.117e-02 3.647e-02 -1.403 0.160835
NeighborhoodSawyer -1.697e-02 3.244e-02 -0.523 0.600886
NeighborhoodSawyerW -3.472e-02 3.147e-02 -1.103 0.270103
NeighborhoodSomerst 7.220e-02 4.009e-02 1.801 0.071939 .
NeighborhoodStoneBr 7.764e-02 3.451e-02 2.250 0.024620 *
NeighborhoodSWISU -4.571e-02 3.748e-02 -1.220 0.222869
NeighborhoodTimber -2.846e-02 3.383e-02 -0.841 0.400336
NeighborhoodVeenker 2.246e-02 3.973e-02 0.565 0.571929
Condition.1Artery -7.044e-02 1.535e-02 -4.589 4.90e-06 ***
Condition.1Feedr -7.402e-02 1.141e-02 -6.486 1.25e-10 ***
Condition.1Park -2.670e-04 1.739e-02 -0.015 0.987752
Condition.1Rail -5.997e-02 1.480e-02 -4.052 5.38e-05 ***
Bldg.Type2fmCon 3.790e-02 9.699e-02 0.391 0.696031
Bldg.TypeDuplex NA NA NA NA
Bldg.TypeTwnhs -4.101e-02 4.983e-02 -0.823 0.410662
Bldg.TypeTwnhsE -3.401e-02 4.657e-02 -0.730 0.465340
House.Style1.5Unf 1.452e-01 1.200e-01 1.210 0.226482
House.Style1Story 4.451e-02 3.137e-02 1.419 0.156180
House.Style2.5Fin -8.497e-02 9.804e-02 -0.867 0.386254
House.Style2.5Unf 4.834e-03 8.411e-02 0.057 0.954181
House.Style2Story 4.421e-02 2.947e-02 1.500 0.133822
House.StyleSFoyer 9.402e-02 4.126e-02 2.279 0.022844 *
House.StyleSLvl 1.099e-01 5.280e-02 2.082 0.037563 *
Overall.Qual 4.386e-02 3.647e-03 12.025 < 2e-16 ***
Overall.Cond 3.438e-02 3.001e-03 11.455 < 2e-16 ***
HouseAge -5.985e-04 1.928e-04 -3.104 0.001954 **
Roof.StyleGable -6.870e-02 5.100e-02 -1.347 0.178219
Roof.StyleGambrel -1.341e-01 5.823e-02 -2.303 0.021454 *
Roof.StyleHip -7.569e-02 5.137e-02 -1.473 0.140909
Roof.StyleMansard -9.772e-02 6.256e-02 -1.562 0.118568
Roof.StyleShed -1.860e-02 9.228e-02 -0.202 0.840325
Roof.MatlMembran 8.715e-02 1.139e-01 0.765 0.444332
Roof.MatlRoll 7.002e-02 9.408e-02 0.744 0.456888
Roof.MatlTar&Grv 3.100e-02 3.754e-02 0.826 0.408998
Roof.MatlWdShake 1.155e-02 4.850e-02 0.238 0.811768
Roof.MatlWdShngl 7.904e-02 5.041e-02 1.568 0.117148
Exterior.1stAsphShn -3.774e-02 9.304e-02 -0.406 0.685059
Exterior.1stBrkComm 1.197e-01 7.003e-02 1.710 0.087573 .
Exterior.1stBrkFace 6.513e-02 2.716e-02 2.398 0.016616 *
Exterior.1stCBlock NA NA NA NA
Exterior.1stCemntBd 4.768e-02 2.816e-02 1.693 0.090731 .
Exterior.1stHdBoard 1.592e-02 2.409e-02 0.661 0.508939
Exterior.1stImStucc -1.087e-02 9.163e-02 -0.119 0.905605
Exterior.1stMetalSd 2.877e-02 2.355e-02 1.222 0.222118
Exterior.1stPlywood 9.545e-03 2.532e-02 0.377 0.706277
Exterior.1stPreCast 3.289e-01 1.013e-01 3.245 0.001204 **
Exterior.1stStucco 1.632e-02 3.006e-02 0.543 0.587327
Exterior.1stVinylSd 3.341e-02 2.395e-02 1.395 0.163211
Exterior.1stWd Sdng 1.428e-02 2.354e-02 0.607 0.544233
Exterior.1stWdShing 2.649e-02 2.844e-02 0.932 0.351759
Mas.Vnr.TypeBrkFace 5.390e-02 2.675e-02 2.015 0.044108 *
Mas.Vnr.TypeNone 1.103e-01 3.638e-02 3.031 0.002488 **
Mas.Vnr.TypeStone 5.822e-02 2.849e-02 2.044 0.041189 *
log(1 + Mas.Vnr.Area) 1.221e-02 5.113e-03 2.387 0.017131 *
Exter.CondFa -2.636e-02 4.346e-02 -0.606 0.544314
Exter.CondGd 2.375e-02 3.777e-02 0.629 0.529609
Exter.CondPo -1.132e-01 1.012e-01 -1.119 0.263270
Exter.CondTA 3.107e-02 3.777e-02 0.823 0.410846
Exter.QualFa -5.306e-02 3.688e-02 -1.439 0.150478
Exter.QualGd -5.349e-02 1.970e-02 -2.715 0.006716 **
Exter.QualTA -5.260e-02 2.179e-02 -2.414 0.015904 *
FoundationCBlock 2.047e-02 1.118e-02 1.831 0.067407 .
FoundationPConc 4.603e-02 1.188e-02 3.873 0.000113 ***
FoundationSlab -6.749e-03 3.072e-02 -0.220 0.826142
FoundationStone 4.719e-02 4.164e-02 1.133 0.257252
FoundationWood 5.419e-02 5.410e-02 1.002 0.316657
Bsmt.QualEx -5.934e-01 1.230e-01 -4.826 1.56e-06 ***
Bsmt.QualFa -6.228e-01 1.225e-01 -5.084 4.25e-07 ***
Bsmt.QualGd -6.587e-01 1.224e-01 -5.383 8.69e-08 ***
Bsmt.QualPo -1.649e-01 1.733e-01 -0.952 0.341390
Bsmt.QualTA -6.586e-01 1.221e-01 -5.395 8.14e-08 ***
Bsmt.CondEx -5.374e-04 6.234e-02 -0.009 0.993124
Bsmt.CondFa -2.399e-02 1.472e-02 -1.630 0.103428
Bsmt.CondGd 1.951e-03 1.303e-02 0.150 0.881017
Bsmt.CondPo 8.582e-02 7.715e-02 1.112 0.266197
Bsmt.CondTA NA NA NA NA
Bsmt.ExposureAv 1.326e-01 8.588e-02 1.544 0.122909
Bsmt.ExposureGd 1.633e-01 8.631e-02 1.892 0.058769 .
Bsmt.ExposureMn 9.901e-02 8.612e-02 1.150 0.250533
Bsmt.ExposureNo 1.089e-01 8.585e-02 1.269 0.204848
log(Total.Bsmt.SF) 9.271e-02 1.217e-02 7.616 5.06e-14 ***
Bsmt.YN NA NA NA NA
HeatingGasA 7.688e-02 9.347e-02 0.823 0.410924
HeatingGasW 1.523e-01 9.678e-02 1.574 0.115838
HeatingGrav 1.256e-02 1.146e-01 0.110 0.912736
HeatingOthW 3.011e-02 1.146e-01 0.263 0.792809
HeatingWall 1.054e-01 1.038e-01 1.015 0.310188
Heating.QC 8.207e-03 3.392e-03 2.419 0.015689 *
Central.AirY 5.164e-02 1.337e-02 3.862 0.000118 ***
log(1 + Low.Qual.Fin.SF) 1.195e-02 4.013e-03 2.977 0.002968 **
Baths 4.982e-02 4.577e-03 10.884 < 2e-16 ***
Bedroom.AbvGr -1.038e-02 4.588e-03 -2.263 0.023781 *
Kitchen.AbvGr -1.030e-01 2.454e-02 -4.199 2.87e-05 ***
Kitchen.Qual 3.426e-02 5.840e-03 5.866 5.66e-09 ***
FunctionalMaj2 -2.494e-01 5.193e-02 -4.804 1.74e-06 ***
FunctionalMin1 -1.625e-02 3.566e-02 -0.456 0.648641
FunctionalMin2 -1.274e-02 3.606e-02 -0.353 0.723946
FunctionalMod -3.512e-02 3.942e-02 -0.891 0.373152
FunctionalTyp 4.746e-02 3.290e-02 1.443 0.149399
Fireplaces 2.516e-02 8.781e-03 2.865 0.004240 **
Fireplace.QuEx 1.185e-02 2.345e-02 0.505 0.613479
Fireplace.QuFa -3.809e-03 1.740e-02 -0.219 0.826753
Fireplace.QuGd 8.029e-03 1.221e-02 0.658 0.510907
Fireplace.QuPo 1.114e-02 2.043e-02 0.545 0.585585
Fireplace.QuTA 1.034e-03 1.211e-02 0.085 0.931945
Garage.Type2Types -2.409e-02 3.354e-02 -0.718 0.472872
Garage.TypeAttchd 7.027e-03 1.677e-02 0.419 0.675292
Garage.TypeBasment -1.302e-02 3.140e-02 -0.415 0.678518
Garage.TypeBuiltIn 1.970e-02 1.991e-02 0.989 0.322625
Garage.TypeCarPort -7.919e-02 5.498e-02 -1.440 0.150039
Garage.TypeDetchd 1.634e-02 1.637e-02 0.998 0.318665
Garage.Finish -7.485e-02 1.519e-01 -0.493 0.622210
Garage.FinishFin 1.826e-02 8.378e-03 2.180 0.029443 *
Garage.FinishRFn -5.550e-03 7.428e-03 -0.747 0.455052
Garage.FinishUnf NA NA NA NA
Garage.Cars 3.427e-02 5.610e-03 6.109 1.33e-09 ***
Garage.QualFa -1.092e-02 1.439e-02 -0.759 0.448201
Garage.QualGd 4.510e-02 2.910e-02 1.550 0.121430
Garage.QualPo -3.117e-01 7.767e-02 -4.013 6.35e-05 ***
Garage.QualTA NA NA NA NA
Garage.CondFa -6.362e-02 1.804e-02 -3.527 0.000435 ***
Garage.CondGd 6.926e-04 4.160e-02 0.017 0.986719
Garage.CondPo 7.258e-02 4.399e-02 1.650 0.099151 .
Garage.CondTA NA NA NA NA
Paved.DriveP -1.649e-02 1.865e-02 -0.884 0.376870
Paved.DriveY 2.848e-02 1.229e-02 2.318 0.020627 *
log(Pool.Area) -2.200e-01 2.224e-01 -0.989 0.322688
Pool.QCEx 1.317e+00 1.152e+00 1.143 0.253103
Pool.QCFa 1.404e+00 1.416e+00 0.991 0.321709
Pool.QCGd 1.446e+00 1.447e+00 0.999 0.317933
Pool.QCTA 1.684e+00 1.362e+00 1.236 0.216777
FenceGdWo 1.227e-03 1.707e-02 0.072 0.942711
FenceMnPrv -1.599e-02 1.353e-02 -1.182 0.237505
FenceMnWw -4.282e-02 3.570e-02 -1.199 0.230655
FenceNone -1.278e-02 1.234e-02 -1.036 0.300596
Misc.Val 8.128e-07 4.966e-06 0.164 0.870024
Mo.Sold2 -9.601e-03 1.705e-02 -0.563 0.573370
Mo.Sold3 -1.316e-02 1.525e-02 -0.863 0.388463
Mo.Sold4 1.720e-02 1.479e-02 1.163 0.245173
Mo.Sold5 1.143e-02 1.415e-02 0.808 0.419406
Mo.Sold6 1.087e-02 1.386e-02 0.785 0.432847
Mo.Sold7 1.317e-02 1.400e-02 0.941 0.347126
Mo.Sold8 1.980e-03 1.537e-02 0.129 0.897511
[ reached getOption("max.print") -- omitted 18 rows ]
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.08396 on 1283 degrees of freedom
Multiple R-squared: 0.9562, Adjusted R-squared: 0.9491
F-statistic: 134 on 209 and 1283 DF, p-value: < 2.2e-16
plot(model.0)
not plotting observations with leverage one:
47, 168, 242, 284, 411, 638, 655, 739, 804, 990, 1011, 1194, 1198, 1324, 1344, 1376
not plotting observations with leverage one:
47, 168, 242, 284, 411, 638, 655, 739, 804, 990, 1011, 1194, 1198, 1324, 1344, 1376
NaNs producedNaNs produced
#AIC
#model.AIC=step(model.0, k=2)
#summary(model.AIC)
#plot(model.AIC)
#BIC
#model.BIC=step(model.0, k=log(nrow(data_train)))
model.BIC=lm(formula = log(price) ~ log(Lot.Area) + Neighborhood + Condition.1 +
Overall.Qual + Overall.Cond + HouseAge + Foundation + Bsmt.Qual +
Bsmt.Exposure + log(Total.Bsmt.SF) + Heating.QC + Central.Air +
Baths + Bedroom.AbvGr + Kitchen.AbvGr + Kitchen.Qual + Functional +
Fireplaces + Garage.Cars + Paved.Drive + log(Pool.Area) +
log(TotalSq), data = data_train)
summary(model.BIC)
Call:
lm(formula = log(price) ~ log(Lot.Area) + Neighborhood + Condition.1 +
Overall.Qual + Overall.Cond + HouseAge + Foundation + Bsmt.Qual +
Bsmt.Exposure + log(Total.Bsmt.SF) + Heating.QC + Central.Air +
Baths + Bedroom.AbvGr + Kitchen.AbvGr + Kitchen.Qual + Functional +
Fireplaces + Garage.Cars + Paved.Drive + log(Pool.Area) +
log(TotalSq), data = data_train)
Residuals:
Min 1Q Median 3Q Max
-0.66831 -0.05150 0.00025 0.05629 0.32251
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.3926554 0.1169000 63.239 < 2e-16 ***
log(Lot.Area) 0.1026436 0.0075547 13.587 < 2e-16 ***
NeighborhoodBlueste 0.0146631 0.0419500 0.350 0.726738
NeighborhoodBrDale -0.0634618 0.0347412 -1.827 0.067954 .
NeighborhoodBrkSide -0.0219612 0.0306986 -0.715 0.474491
NeighborhoodClearCr 0.0028061 0.0338084 0.083 0.933862
NeighborhoodCollgCr -0.0131967 0.0275606 -0.479 0.632136
NeighborhoodCrawfor 0.0479177 0.0311068 1.540 0.123679
NeighborhoodEdwards -0.1000400 0.0294309 -3.399 0.000695 ***
NeighborhoodGilbert -0.0379098 0.0287188 -1.320 0.187036
NeighborhoodGreens 0.0359676 0.0442891 0.812 0.416864
NeighborhoodGrnHill 0.4397167 0.0709210 6.200 7.38e-10 ***
NeighborhoodIDOTRR -0.1451682 0.0324860 -4.469 8.49e-06 ***
NeighborhoodLandmrk -0.0681952 0.0949941 -0.718 0.472943
NeighborhoodMeadowV -0.1266187 0.0383000 -3.306 0.000970 ***
NeighborhoodMitchel -0.0368471 0.0295088 -1.249 0.211987
NeighborhoodNAmes -0.0565614 0.0285542 -1.981 0.047801 *
NeighborhoodNoRidge 0.0623060 0.0303150 2.055 0.040034 *
NeighborhoodNPkVill -0.0208120 0.0384833 -0.541 0.588726
NeighborhoodNridgHt 0.0509080 0.0290237 1.754 0.079643 .
NeighborhoodNWAmes -0.0544560 0.0297828 -1.828 0.067693 .
NeighborhoodOldTown -0.1188641 0.0294370 -4.038 5.68e-05 ***
NeighborhoodSawyer -0.0229563 0.0298927 -0.768 0.442639
NeighborhoodSawyerW -0.0515292 0.0289758 -1.778 0.075559 .
NeighborhoodSomerst 0.0566938 0.0276847 2.048 0.040759 *
NeighborhoodStoneBr 0.0745779 0.0321385 2.321 0.020453 *
NeighborhoodSWISU -0.0604508 0.0340779 -1.774 0.076293 .
NeighborhoodTimber -0.0324647 0.0317804 -1.022 0.307176
NeighborhoodVeenker -0.0167331 0.0376356 -0.445 0.656671
Condition.1Artery -0.0760394 0.0151294 -5.026 5.64e-07 ***
Condition.1Feedr -0.0767356 0.0113456 -6.763 1.96e-11 ***
Condition.1Park 0.0069725 0.0182004 0.383 0.701706
Condition.1Rail -0.0496206 0.0149453 -3.320 0.000922 ***
Overall.Qual 0.0494652 0.0034280 14.430 < 2e-16 ***
Overall.Cond 0.0351455 0.0027881 12.605 < 2e-16 ***
HouseAge -0.0008503 0.0001883 -4.516 6.84e-06 ***
FoundationCBlock 0.0583435 0.0105020 5.555 3.30e-08 ***
FoundationPConc 0.0718433 0.0115567 6.217 6.66e-10 ***
FoundationSlab 0.0630207 0.0291017 2.166 0.030512 *
FoundationStone 0.0038723 0.0390916 0.099 0.921107
FoundationWood 0.0465776 0.0554929 0.839 0.401418
Bsmt.QualEx -0.7570306 0.1136937 -6.659 3.94e-11 ***
Bsmt.QualFa -0.8357544 0.1124732 -7.431 1.85e-13 ***
Bsmt.QualGd -0.8407558 0.1123115 -7.486 1.24e-13 ***
Bsmt.QualPo -0.8411315 0.1443930 -5.825 7.04e-09 ***
Bsmt.QualTA -0.8503101 0.1122157 -7.577 6.31e-14 ***
Bsmt.ExposureAv 0.1590409 0.0918543 1.731 0.083588 .
Bsmt.ExposureGd 0.1929930 0.0921545 2.094 0.036415 *
Bsmt.ExposureMn 0.1151061 0.0920620 1.250 0.211392
Bsmt.ExposureNo 0.1296663 0.0917945 1.413 0.158000
log(Total.Bsmt.SF) 0.1191674 0.0091824 12.978 < 2e-16 ***
Heating.QC 0.0091548 0.0033638 2.722 0.006577 **
Central.AirY 0.0587416 0.0121548 4.833 1.49e-06 ***
Baths 0.0470497 0.0045251 10.397 < 2e-16 ***
Bedroom.AbvGr -0.0140992 0.0042615 -3.308 0.000961 ***
Kitchen.AbvGr -0.0977644 0.0140392 -6.964 5.05e-12 ***
Kitchen.Qual 0.0346078 0.0056027 6.177 8.51e-10 ***
FunctionalMaj2 -0.1661268 0.0518564 -3.204 0.001387 **
FunctionalMin1 0.0143539 0.0345741 0.415 0.678085
FunctionalMin2 0.0352368 0.0342227 1.030 0.303356
FunctionalMod -0.0059146 0.0372751 -0.159 0.873948
FunctionalTyp 0.0882264 0.0312921 2.819 0.004877 **
Fireplaces 0.0278606 0.0047452 5.871 5.37e-09 ***
Garage.Cars 0.0400643 0.0046975 8.529 < 2e-16 ***
Paved.DriveP 0.0101521 0.0181899 0.558 0.576854
Paved.DriveY 0.0573398 0.0114111 5.025 5.67e-07 ***
log(Pool.Area) 0.0175609 0.0059007 2.976 0.002969 **
log(TotalSq) 0.3701184 0.0154033 24.029 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.09123 on 1425 degrees of freedom
Multiple R-squared: 0.9426, Adjusted R-squared: 0.9399
F-statistic: 349 on 67 and 1425 DF, p-value: < 2.2e-16
plot(model.BIC)
not plotting observations with leverage one:
638, 655, 990
not plotting observations with leverage one:
638, 655, 990
# Exploring the remaining predictors relationship to price
# plot(log(data_train$Lot.Area), log(data_train$price))
# plot((data_train$Neighborhood), log(data_train$price))
# plot((data_train$Condition.1), log(data_train$price))
# plot((data_train$Overall.Qual), log(data_train$price))
# plot((data_train$Overall.Cond), log(data_train$price))
# plot((data_train$HouseAge), log(data_train$price))
# plot((data_train$Bsmt.Qual), log(data_train$price))
# plot((data_train$Bsmt.Exposure), log(data_train$price))
# plot(log(data_train$Total.Bsmt.SF), log(data_train$price))
# plot((data_train$Heating.QC), log(data_train$price))
# plot((data_train$Central.Air), log(data_train$price))
# plot((data_train$Baths), log(data_train$price))
# plot((data_train$Bedroom.AbvGr), log(data_train$price))
# plot((data_train$Kitchen.AbvGr), log(data_train$price))
# plot((data_train$Kitchen.Qual), log(data_train$price))
# plot((data_train$Functional), log(data_train$price))
# plot((data_train$Fireplaces), log(data_train$price))
# plot((data_train$Paved.Drive), log(data_train$price))
# plot((data_train$Garage.Cars), log(data_train$price))
# plot((data_train$Garage.Cars), log(data_train$price))
# plot(log(1+data_train$Pool.Area), log(data_train$price))
# plot(log(data_train$TotalSq), log(data_train$price))
termplot(model.BIC, partial.resid = TRUE, col.res = "purple", cex = 0.5,
rug = T, se = T, smooth = panel.smooth)
# There are 3 high leverage points - may want to exclude them
hh<-hatvalues(model.BIC)
id<-which(hh==1)
plot(hatvalues(model.BIC), type = "h")
## Prepare test data
load("ames_test.Rdata")
data=ames_test
data <- data %>%
#filter(!is.na(Lot.Frontage)) %>%
mutate(MS.SubClass= factor(MS.SubClass)) %>%
mutate(Alley = factor(Alley, levels = levels(addNA(Alley)), labels = c(levels(Alley), "None"), exclude = NULL)) %>%
mutate(HouseAge = Yr.Sold- pmax(Year.Built, Year.Remod.Add)) %>%
#filter(!is.na(Mas.Vnr.Area)) %>%
mutate(Bsmt.YN = 1*(!is.na(Bsmt.Qual))) %>%
mutate(Bsmt.Qual = factor(Bsmt.Qual, levels = levels(addNA(Bsmt.Qual)), labels = c(levels(Bsmt.Qual), "None"), exclude = NULL)) %>%
mutate(Bsmt.Qual = relevel(Bsmt.Qual, ref="None")) %>%
mutate(Bsmt.Cond = factor(Bsmt.Cond, levels = levels(addNA(Bsmt.Cond)), labels = c(levels(Bsmt.Cond), "None"), exclude = NULL)) %>%
mutate(Bsmt.Cond = relevel(Bsmt.Cond, ref="None")) %>%
mutate(Bsmt.Exposure = factor(Bsmt.Exposure, levels = levels(addNA(Bsmt.Exposure)), labels = c(levels(Bsmt.Exposure), "None"), exclude = NULL)) %>%
mutate(Bsmt.Exposure = relevel(Bsmt.Exposure, ref="None")) %>%
mutate(BsmtFin.Type.1= factor(BsmtFin.Type.1, levels = levels(addNA(BsmtFin.Type.1)), labels = c(levels(BsmtFin.Type.1), "None"), exclude = NULL)) %>%
mutate(BsmtFin.Type.1 = relevel(BsmtFin.Type.1, ref="None")) %>%
mutate(BsmtFin.Type.2= factor(BsmtFin.Type.2, levels = levels(addNA(BsmtFin.Type.2)), labels = c(levels(BsmtFin.Type.2), "None"), exclude = NULL)) %>%
mutate(BsmtFin.Type.2 = relevel(BsmtFin.Type.2, ref="None")) %>%
mutate(X12.SF= X1st.Flr.SF+ X2nd.Flr.SF) %>%
#filter(!is.na(Bsmt.Full.Bath)) %>%
#filter(!is.na(Bsmt.Half.Bath)) %>%
mutate(Baths = Bsmt.Full.Bath + 0.5*Bsmt.Half.Bath + Full.Bath + 0.5*Half.Bath) %>%
mutate(Fireplace.YN = 1*(Fireplaces>0)) %>%
mutate(Fireplace.Qu = factor(Fireplace.Qu, levels = levels(addNA(Fireplace.Qu)), labels = c(levels(Fireplace.Qu), "None"), exclude = NULL)) %>%
mutate(Fireplace.Qu = relevel(Fireplace.Qu, ref="None")) %>%
mutate(Garage.YN = 1*(!is.na(Garage.Cond))) %>%
mutate(Garage.Type = factor(Garage.Type, levels = levels(addNA(Garage.Type)), labels = c(levels(Garage.Type), "None"), exclude = NULL)) %>%
mutate(Garage.Type = relevel(Garage.Type, ref="None")) %>%
mutate(Garage.Finish = factor(Garage.Finish, levels = levels(addNA(Garage.Finish)), labels = c(levels(Garage.Finish), "None"), exclude = NULL)) %>%
mutate(Garage.Finish = relevel(Garage.Finish, ref="None")) %>%
mutate(Garage.Qual = factor(Garage.Qual, levels = levels(addNA(Garage.Qual)), labels = c(levels(Garage.Qual), "None"), exclude = NULL)) %>%
mutate(Garage.Qual = relevel(Garage.Qual, ref="None")) %>%
mutate(Garage.Cond = factor(Garage.Cond, levels = levels(addNA(Garage.Cond)), labels = c(levels(Garage.Cond), "None"), exclude = NULL)) %>%
mutate(Garage.Cond = relevel(Garage.Cond, ref="None")) %>%
mutate(Porch.Area = Wood.Deck.SF+ Open.Porch.SF+Enclosed.Porch+X3Ssn.Porch + Screen.Porch) %>%
mutate(Pool.YN = 1*(Pool.Area>0)) %>%
mutate(Pool.QC = factor(Pool.QC, levels = levels(addNA(Pool.QC)), labels = c(levels(Pool.QC), "None"), exclude = NULL)) %>%
mutate(Pool.QC = relevel(Pool.QC, ref="None")) %>%
mutate(Fence = factor(Fence, levels = levels(addNA(Fence)), labels = c(levels(Fence), "None"), exclude = NULL)) %>%
mutate(Misc.Feature = factor(Misc.Feature, levels = levels(addNA(Misc.Feature)), labels = c(levels(Misc.Feature), "None"), exclude = NULL)) %>%
mutate(Mo.Sold = as.factor(Mo.Sold)) %>%
mutate(Yr.Sold = as.factor(Yr.Sold)) %>%
dplyr::select(-Garage.Yr.Blt) %>%
mutate(Condition.1 = as.character(Condition.1)) %>%
mutate(Kitchen.Qual=plyr::mapvalues(Kitchen.Qual, from = c("Po", "Fa", "TA","Gd", "Ex" ), to = c("1", "2", "3", "4", "5"))) %>%
mutate(Kitchen.Qual = as.numeric(as.character(Kitchen.Qual))) %>%
mutate(Heating.QC=plyr::mapvalues(Heating.QC, from = c("Po", "Fa", "TA","Gd", "Ex" ), to = c("1", "2", "3", "4", "5"))) %>%
mutate(Heating.QC = as.numeric(as.character(Heating.QC))) %>%
mutate(Bsmt.Qual = droplevels(Bsmt.Qual)) %>%
mutate(Functional = droplevels(Functional)) %>%
mutate(Roof.Matl = droplevels(Roof.Matl))
ind_rail<-which(data$Condition.1=="RRNn" | data$Condition.1=="RRAn" | data$Condition.1=="RRNe" | data$Condition.1=="RRAe")
ind_park<-which(data$Condition.1=="PosN" | data$Condition.1=="PosA")
data$Condition.1[ind_rail]<-"Rail"
data$Condition.1[ind_park]<-"Park"
data = data %>%
mutate(Condition.1 = factor(Condition.1)) %>%
mutate(Condition.1 = relevel(Condition.1, ref="Norm"))
data_test=data
data_test$Bsmt.Exposure[which(data_test$Bsmt.Exposure=="")]<-"None"
data_test$Bsmt.Exposure<-droplevels(data_test$Bsmt.Exposure)
data_test$Pool.Area<-data_test$Pool.Area+1
data_test$Total.Bsmt.SF<-data_test$Total.Bsmt.SF+1
# extract the truth
Y = data_test$price
# make prediction based on specific model
Yhat = predict(model.BIC, newdata=data_test, interval="predict")
# depending on the response transformation
Yhat = exp(Yhat)
# name dataframe as predictions! DO NOT CHANGE
predictions = as.data.frame(Yhat)
predictions$PID = data_test$PID
save(predictions, file="predict.Rdata")
performance(Y, Yhat)
# Lasso for the simplified model (from BIC)
library(glmnet)
Loading required package: Matrix
Attaching package: ‘Matrix’
The following object is masked from ‘package:tidyr’:
expand
Loading required package: foreach
foreach: simple, scalable parallel programming from Revolution Analytics
Use Revolution R for scalability, fault tolerance and more.
http://www.revolutionanalytics.com
Loaded glmnet 2.0-5
X.train = model.matrix(log(price) ~ log(Lot.Area) + Neighborhood + Condition.1 +
Overall.Qual + Overall.Cond + HouseAge + Foundation + Bsmt.Qual +
Bsmt.Exposure + log(Total.Bsmt.SF) + Heating.QC + Central.Air +
Baths + Bedroom.AbvGr + Kitchen.AbvGr + Kitchen.Qual + Functional +
Fireplaces + Garage.Cars + Paved.Drive + log(Pool.Area) +
log(TotalSq), data=data_train)
X.test = model.matrix(log(price) ~ log(Lot.Area) + Neighborhood + Condition.1 +
Overall.Qual + Overall.Cond + HouseAge + Foundation + Bsmt.Qual +
Bsmt.Exposure + log(Total.Bsmt.SF) + Heating.QC + Central.Air +
Baths + Bedroom.AbvGr + Kitchen.AbvGr + Kitchen.Qual + Functional +
Fireplaces + Garage.Cars + Paved.Drive + log(Pool.Area) +
log(TotalSq), data=data_test)
model.lasso = glmnet(X.train, log(data_train$price), alpha=1)
cv.lasso = cv.glmnet(X.train, log(data_train$price), alpha=1)
yhat.lasso = exp(predict(model.lasso, s=cv.lasso$lambda.min, type="response", newx = X.test))
sqrt(mean(((yhat.lasso)-data_test$price)^2))
[1] 15559.69
# Ridge for the simplified model (from BIC)
model.ridge = glmnet(X.train, log(data_train$price), alpha=0)
cv.ridge = cv.glmnet(X.train, log(data_train$price), alpha=0)
yhat.ridge = predict(model.ridge, s=cv.ridge$lambda.min, type="response", newx = X.test)
sqrt(mean((exp(yhat.ridge)-data_test$price)^2))
[1] 16492.59
Your models will be evaluated on the following criteria on the test data:
* Bias: Average (Yhat-Y) positive values indicate the model tends to overestimate price (on average) while negative values indicate the model tends to underestimate price.
* Maximum Deviation: Max |Y-Yhat| - identifies the worst prediction made in the validation data set.
* Mean Absolute Deviation: Average |Y-Yhat| - the average error (regardless of sign).
Root Mean Square Error: Sqrt Average (Y-Yhat)^2
Coverage: Average( lwr < Y < upr)
In order to have a passing wercker badge, your file for predictions needs to be the same length as the test data, with three columns: fitted values, lower CI and upper CI values in that order with names, fit, lwr, and upr respectively.
You will be able to see your scores on the score board (coming soon!). They will be initialized by a predction based on the mean in the training data.
Model Check - Test your prediction on the first observation in the training and test data set to make sure that the model gives a reasonable answer and include this in a supplement of your report. This should be done BY HAND using a calculator (this means use the raw data from the original dataset and manually calculate all transformations and interactions with your calculator)! Models that do not give reasonable answers will be given a minimum 2 letter grade reduction. Also be careful as you cannot use certain transformations [log or inverse x] if a variable has values of 0.
In this part you may go all out for constructing a best fitting model for predicting housing prices using methods that we have covered this semester. You should feel free to to create any new variables (such as quadratic, interaction, or indicator variables, splines, etc). The variable TotalSq = X1st.Flr.SF+X2nd.Flr.SF was added to the dataframe (that does not include basement area, so you may improve on this. A relative grade is assigned by comparing your fit on the test set to that of your fellow students with bonus points awarded to those who substantially exceed their fellow students and point reductions occurring for models which fit exceedingly poorly.
X.test = model.matrix(log(price) ~ log(Lot.Area) + Neighborhood + Condition.1 +
Overall.Qual + Overall.Cond + HouseAge + Foundation + Total.Bsmt.SF +
Central.Air + log(X12.SF) + Baths + Kitchen.AbvGr + Kitchen.Qual +
Functional + Fireplaces + Garage.Cars + Paved.Drive + Bsmt.YN:Bsmt.Exposure,
data = test)
Error in eval(expr, envir, enclos) : object 'HouseAge' not found
Update your predictions using your complex model to provide point estimates and CI.
You may iterate here as much as you like exploring different models until you are satisfied with your results.
Bayesian lasso
library(glmnet)
library(monomvn)
housing.ridge = lm.ridge(log(price) ~ (MS.SubClass + MS.Zoning + log(Lot.Frontage) + log(Lot.Area) + Street + Alley + Lot.Shape + Land.Contour + Lot.Config + Land.Slope + Neighborhood + Condition.1 + Bldg.Type + House.Style + Overall.Qual + Overall.Cond + HouseAge + Roof.Style + Roof.Matl + Exterior.1st + Exterior.2nd + Mas.Vnr.Type + log(1+Mas.Vnr.Area) + Exter.Cond + Exter.Cond + Foundation + Bsmt.YN:Bsmt.Qual + Bsmt.YN:Bsmt.Cond + Bsmt.YN:Bsmt.Exposure + Total.Bsmt.SF + Heating + Heating.QC + Central.Air + Electrical + log(X12.SF) + log(1+Low.Qual.Fin.SF) + Baths + Bedroom.AbvGr + Kitchen.AbvGr + Kitchen.Qual + Functional + Fireplaces + Fireplace.YN:Fireplace.Qu + Garage.YN:Garage.Type + Garage.YN:Garage.Finish + Garage.Cars + Garage.YN:Garage.Cond + Garage.YN:Garage.Qual + Paved.Drive + log(1+Pool.Area) + Pool.YN:Pool.QC + Fence + Misc.Val + Mo.Sold +Yr.Sold + Sale.Type + TotalSq)^2,
lambda = seq(0, 5, 0.0001), data = data)
data2 = data %>% mutate(bsmt.qual = Bsmt.YN*Bsmt.Qual, bsmt.cond = Bsmt.YN:Bsmt.Cond,
bsmt.exposure = Bsmt.YN:Bsmt.Exposure)
load("~/Desktop/STA 521/Project_homoBayesians/ames_test.Rdata")
ctest = ames_test
ctrain = data
# ridge:
housing.ridge = lm.ridge(log(price) ~ (MS.SubClass + MS.Zoning+ log(Lot.Frontage) + log(Lot.Area) + Street)^2,
lambda = seq(0, 5, 0.0001), data = data)
best.lambda = housing.ridge$lambda[which.min(housing.ridge$GCV)]
# lasso:
# X and Y:
X = model.matrix(log(price) ~ (MS.SubClass + MS.Zoning + log(Lot.Frontage) + log(Lot.Area) + Street + Alley + Lot.Shape + Land.Contour + Lot.Config + Land.Slope + Neighborhood + Condition.1 + Bldg.Type + House.Style + Overall.Qual + Overall.Cond + HouseAge + Roof.Style + Roof.Matl + Exterior.1st + Exterior.2nd + Mas.Vnr.Type + log(1+Mas.Vnr.Area) + Exter.Cond + Exter.Cond + Foundation + Total.Bsmt.SF + Heating + Heating.QC + Central.Air + Electrical + log(X12.SF) + log(1+Low.Qual.Fin.SF) + Baths + Bedroom.AbvGr + Kitchen.AbvGr + Kitchen.Qual + Functional + Fireplaces + Garage.Cars + + Paved.Drive + log(1+Pool.Area) + Fence + Misc.Val + Mo.Sold +Yr.Sold + Sale.Type + TotalSq)^2
+ Bsmt.YN:Bsmt.Qual + Bsmt.YN:Bsmt.Cond + Bsmt.YN:Bsmt.Exposure + Fireplace.YN:Fireplace.Qu + Garage.YN:Garage.Type + Garage.YN:Garage.Finish + Garage.YN:Garage.Cond + Garage.YN:Garage.Qual + Pool.YN:Pool.QC,
data = data)
Y = as.matrix(data[,3])
# lasso:
lasso.cv = cv.glmnet(x = X, y = Y, alpha=1, lambda = seq(0, 2, 0.001), family = "gaussian")
lambda.lasso = lasso.cv$lambda.min
# lasso with the best lambda
housing.lasso = glmnet(X, Y, alpha=1, lambda = lambda.lasso, family = "gaussian")
coef(housing.lasso)
#SCALED??
X.scaled = scale(X)
housing.blasso.RJ = blasso(X.scaled, Y, RJ = TRUE, verb=0)
summary(housing.blasso.RJ)$coef
# prediction:
# ypred.blasso.RJ = mean(college.blasso.RJ$mu) + as.matrix(test.scaled) %*% apply(college.blasso.RJ$beta, 2, mean)
# rmse.blasso.RJ = rmse(exp(ctest$Apps), exp(ypred.blasso.RJ))
# center and scale X
X.scaled = X
X.scaled[,2:15] = scale(X[,2:15], center =college.ridge.cv$xm[c(-1,-16)],
scale = college.ridge.cv$scales[c(-1,-16)])
# RJ = FALSE
college.blasso = blasso(X.scaled, Y, RJ = FALSE, verb=0) ypred.blasso = mean(college.blasso$mu) +
as.matrix(test.scaled) %*% apply(college.blasso$beta, 2, mean) rmse.blasso = rmse(exp(ctest$Apps), exp(ypred.blasso))
# model selection with RJ = TRUE
college.blasso.RJ = blasso(X.scaled, Y, RJ = TRUE, verb=0)
ypred.blasso.RJ = mean(college.blasso.RJ$mu) + as.matrix(test.scaled) %*% apply(college.blasso.RJ$beta, 2, mean)
rmse.blasso.RJ = rmse(exp(ctest$Apps), exp(ypred.blasso.RJ))
summary(college.blasso.RJ)$coef
Once you are satisfied with your model, provide a write up of your data analysis project in a new Rmd file/pdf file: writeup.Rmd by copying over salient parts of your R notebook. The written assignment consists of five parts:
Exploratory data analysis (20 points): must include three correctly labeled graphs and an explanation that highlight the most important features that went into your model building.
Development and assessment of an initial model from Part I (10 points)
Initial model: must include a summary table and an explanation/discussion for variable selection. Interpretation of coefficients desirable for full points.
Model selection: must include a discussion
Residual: must include a residual plot and a discussion
RMSE: must include an RMSE and an explanation (other criteria desirable)
Model testing: must include an explanation
Final model: must include a summary table
Variables: must include an explanation
Variable selection/shrinkage: must use appropriate method and include an explanation
Residual: must include a residual plot and a discussion
RMSE: must include an RMSE and an explanation (other criteria desirable)
Model evaluation: must include an evaluation discussion
Model testing : must include a discussion
Model result: must include a selection of the top 10 undervalued and overvalued houses
Create predictions for the validation data from your final model and write out to a file prediction-validation.Rdata This should have the same format as the models in Part I and II.
10 points
Each Group should prepare 5 slides in their Github repo: (save as slides.pdf)
Most interesting graphic (a picture is worth a thousand words prize!)
Best Model (motivation, how you found it, why you think it is best)
Best Insights into predicting Sales Price.
2 Best Houses to purchase (and why)
Best Team Name/Graphic
We will select winners based on the above criteria and overall performance.
Finally your repo should have: writeup.Rmd, writeup.pdf, slides.Rmd (and whatever output you use for the presentation) and predict.Rdata and predict-validation.Rdata.